SOCUfiBN¥ RESUlffi 



ED 276 759 



AUTHOR 
TITLE 

PUB BATE 
NOTE 



PUB TYPE 



EBSS PRICE 
DESCRIPTORS 



IDENTIFIERS 



TH 860 710 

Kulik^ Jsjnes -A.l Kulik^ Cheh-Lin Ci : 

Operative and Ihterpretatbie Effect Sizes in 
He ta-Analys i s . 
Apr 86 ^ ^ 

24p.; Paper presented at the Annual Meeting of the 
American Educational RisigrchAssociation (67th, San 
FranciscO; CA» April 16-20, 1986) . 
Cpeeches/Cotiferehce Papers (150) — Reports - 
Researeh/Techhical (143) 



Plus Postage. 
*Effect Size; Error of Measurement; Estimation 
iMathematics ) ; *Hathimatical Hodels^ *Heta Analysis; 
Pretests Posttestsf rReSearch Methodology; Social 
science Research; statistical studies; Test 
Iflterpratation 

*E£fect Size Estimator (Glass); *Glass Analysis 
Method 



ABSTRACT - 

___ Stat ist ical methodologists havi sottet imes : cri t icized 
the use of conventional statistics in meta-analysis^ and iii recent 
years a number: of them have advocated the: use 6£ a special new --^ 
statistical methodology for research: synthesis. An examination of 
recent -books describing this methodology-shows that it is seriously 
liiiiitediiii_ its applicability to social science research findings. The 
new-nethodology produces interpretable me results only in 

exceptional circumstances (e.g. , when each study in a collection uses 
the same unblocked, posttest-only experimental design K The iiew 
statistical methodology for meta-analysis has produced 
uninterpretable results when applied to typical collections of social 
science studies with varied experimental designs. (Author) 



* Reproductions su]^lied by EORS are the best that can be made * 

* from the original documents * 
********************************^ 

o 



Operative and Interpretable Effect Sizes 
In Meta-analysis 



James A. Kulik & Chen-Lin C. Kulik 



The University of Michigan 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC):" 



= bEPMltMENtbF Eb^ 

Office bl Educational Raite«fc»:ar>d irn^royerr^^^ 

EbUCATIDNAL RESOURCES JNFORMATION 
: : CENTER (ERIQ 

j^iui document .his_t)een reprodpced as 
^received from the person or organization 

oHQinattbo it 
□ MwioMSiahQesiJiwe been made to improve 

repro ducttorr tjuality. 



iPoints<>f view or-opmioha stated jn this docu- 
ment do not necessarily represent otticial 
OEPi position or policy. 



A paper presentation at the annual meeting 
Of the American Educational Research Association, 
Sah Francisco, ^ril 1986 



Effect Sizes - 1 



Abstract 

_ _ _ Statist£»l methddologis ts sometimes^ cr iticized _ the use_ 
of-Conventlo^l_statlstlcs^ meta-axiaiysis^^^d ln_rec 
nufl^er of than faa^e advocate the use^^^^^ statistical 
methodology for research synthesis • ^ examination of recent 
books describing this methodology shows that it is seriously 
limited in its af^licability to social science research findings. 
The hew methodology produces ihterpretable meta- results 
only in exceptional circumstances. (e^g^^ tAieh each study ih a 
collection uses the same 

design)^ The new statistical methodoi meta-analysis has 

produced uninterpretable results when applied to typical 
collections of social science studies with varied es^erimental 
designs • 



3 



Effect Sizes - 2 



Operative azul Interpretabie Effect Sizes in Meta-analysis 



In a classic 1976 paper Glass defined set a-aalysis as the 
a|:¥>lication of statistical methods to results f rem a large 
cbllectiPh of studies for the puri^ findings. 
The statistical-methbds that„Slass used in ^ta'*-analysis vere 
cony^timal meSf stich as variance and muitiple _ 

regression amiysis In meta-analysis^ however , these- statistics 
were applied not to raw observations, but rather to effect-sizes , 
or standardized scores that represented the size of treatment 
effM:ts in different studieS:^pn a cotmion scale of standard j 
deviation units. Hedges {1982 ) has recently c on Glass's 

use of conventional statistics in research synthesis: 

Su^ use se^ed at first to be an innocuous extension ^^o^ 
statistical meth^s to a new situation.^ 1^ recent 
research has demonstrated that the use of such statistical 
procedures as analysis of variance and rigressipn analysiis 
cannot be justified for meta-analysis. For tunately; some iiew 
statistical procedures have been designed specifically for 
meta-analysis (p. 25). 

- -^ This new methodolo^ £or meta-an^ builds on statistical 
t^hniques originally developed by Cochran (1954) for testing tine 
homogeneity of results in related experiments and for making 
caQE)osite= estimates of peculation parameters from such results. 
Hedges (1981, 1982) first showed how Cochran's techniques could be 
applied to e:q)erimental results coded as eff^t sizes « Hunter^ 
Schmidt ^ ahd_ Jacksoh_J[1982I_and_Rosehth^ (1984} later advocated 
the use of .forrolas ahd te to those used-^ 

Hedges^ Hedges (1984^ -has refu^red to^ methods used by 
himself , Rosenthal^ md Hunta and Sdimidt as "mxlern statistical 
methods for meta-analysis." Bangert-browns (in press) has 
described the methods as "ai^roximate data pooling." 

Recent bck>lcs deioribi^ 6 Olkin^ 1985; 

Hunter^ Scaimtdt^_&_Ja(acsoh^_1982j and^Sbsmthal^ 1984) almost ^ 
^tirely-ignoze me 6f_t^ central^^^^ases in Glass's work: the 
estimation of effTCt sizes from studies with different 
experimental designs. Glass and his colleagues have argued that 
different procedures are needed for estimating effect sizes in 
si]ii>le e^eriments with unblodced j)o designs and in 

more caaplet experiments {Glass^ HCGawt £ Smithy 1981| HcGaw £ 
Slass^_1980).__Glass_and hiscblleaguesusethe following formula^ 
for CTaj^ie, to estimate effect sizes from si^le e^eriments that 
coopare posttest means of indepoident groups: 



Effect Sizes - 3 



d = — — - , a) 



where d ts the estimator of effect size for a specific study, 

and ^ are sanple means of the experimental group (E) and control 

groi:^ (C) on the dep^tdent >^iabie Y, and is the sa^le 

standard deviation bh_Y^ _Tbey_use dlffereht fbrmttlas fbr 
calculating ^fectslzes_fr^ designs 
ttet rantroi f or Irrei^ in fi 

comparisons of matched groups, ccxqpari sons of gain scores, 
covar lance analyses, multlf actor analyses of variance, etc. (Glass 
et al., 1981, pp. 114-123). 

The recent books on allgive 

Fbrmula l as a baslc_formula for estimtihg_e££TCt_slzes^ bu^ hone 
of _the_bo^s quotes eve^ of the additional fomilas tha^ 
and^his coliugues-^^e used-f or ralculattog effect sizes from^ 
studies with more CQoplex experimental designs.. Hedges and Olkin 
(1985, p. 13), for example^ have sl^ly noted that such formulas 
are outside their dcmain of interest; th^ dp not consider them a 
central issue in the statistics of meta-analysis. Rosenthal 
C1984, E^. 30^31) has referred to these formulas as **fbrmulas for 
adjusting effect sizes** aiid has cautioned ttet those^ calculate 
th^ should also report **unadjusted effect sizes** alongside the 
**ad justed** ones. 

Neither Rosenthal nor other developers of the newer 
statistical methods for mita-axialysis, however, have written much 
about the calculation of **unadjusted** effect sizes with powerful 
eaperimental^ a Recent bbbks bh meta-analysis fbcus almbs 

inclusively bh.calculatlon of ef f ect 5lzes_wlth_ah uhb _ 
^stt^t-bhly deslgh^--The- bhly-desl^ otfaCT tfaah this one covered 
t^^tail in recent books is the cqnoparison of pre-post gains of 
^i^erimental and control groups. What recent methodologists have 
said iftbDUt this design shows how different their af^roach is from 
Glans*s« 



Glass and bis cblleagues provided the following formula for 
calculating ah effect size for this design: 



Effect Sizes - 4 



(2) 



id)ere is the effect size estimated fron a conparisoh of gain 

scores and ^ and l£ are the average gains for the ei^eri^^tat 

_ _ _ __ 

and cOTtroi groups Tslass et al*,^ ifSi^ pp* 115-1185* _H^^ 
the iramerator of Formulas 1 and 2 axe calculated different ly • ~ ^e 
numerator of Formula 2 is calculated fron group gains rather than 
from group postscores. But formulas (1) and (2) dp not differ in 
dendtninatbrs. For toth designs , the pdsttest standard deviation 
is used in standardizing the mean differences. 

Rosenthal,^ on the other faaxkd, has provided the following 

formula for use with gains analyses (1965, p. 21): 




where df is the ef f «t size estimated by Rosenthal from a 

ccm^^ison of gain scores; ^ and ^ are defined as above; and 5^ 

is the standard deviation of the gain scores and is equal to 

Syi^ 2(1 - r^j I where r^ is the cbrrelatiori between pretest (X) 

aid posttesF*(¥) scores. Note that Rosenthal and Glass's formulas 
differ in standirdizatidn term. Rosenthal uses the_ standard 
deviation of gain scores iii the deh<aninatbr bf his fbrmula^ 
idiereai Glass uses the staxidard deviatibh of the pbsttest i^ his 
f^^la. Effect sizes calculated by Glass and Rosenthal f rcmi the 
^ame gains analysis would therefore be related by the following 
formula: 



d^ = . (4) 

Because pretest-posttest r*s can be quite high in educational and 
psychological research, effect sizes calculated 1^ Glass and 



Effect Sizes - 5 



RDSinthal can diffifir hy a larg^ afiount. For exal^ler with a: 
gretest-pdsttest corrtlation of ^8, an effect size calculated from 
Fbrnula 2 would be only 66t as large as mt effect size calculatad 
from Formula 3 for the same experiment. 

: kraemer and (1982), who have also contributed to the 
develc^ent:iOf new statistical methods for meta-analysis, have 
criticized Fprmula 3 on the grounds that the standardization tera 
Is wroiig. Th^ have pointed but that Formulas l and 3 do hbt^ 
irtloate the same quantity*- what Is r^ark^le about Kramie^ 
and Andrews 1 discussion p-thetr tTC^ 

analysts use ForiDula 3 to calculate effect sizes from ccxi^arisons 
of gains, tong before kraemer and Andrews published their 
articla. Glass and his colleagues had advised meta-analysts not to 
use Formula- 3 In calculating effect sites for ccmiparlsbhs of gains 
but to use Fprmula 2 Instead. Glass's advice has been either 
overlbbkcMi^ ighbr«it b by recent cbhtributbrs to 

meta-analytic methodology. 



- - The^purpose of ^ review issues involved in 

estimauting ef f Kt^izes from studies that use different ^ 
experimental designs. It covers estlmatlbh of effect sizes for 
both siDple posttestHdnly designs and ccoplex designs that r^nove 
sburces bf irrelevant variatibn from the pbsttest. !Ehis_article 

also presents fbrnmlas for calculating the_ standard error of 

ef f ^t_sizes_estimated_uhder_dif f er • _ Finaiiyr the 

article shows-that average effect sizes and their standard errors 
are of ten miscalculated in meta-analyses that use the newer 
statistical methods for research synthesis. 

Operative and Znterpretable Effect Sizes 

The hbtibh of measurihgeffect sizes was a familiar cmetb 
many social sci^tists long ^bre Glass used indices of effect _ 
sizes as a^k^ tool In meta-analysis^ Cofa^ * s (1977) classic book 
on power analysis in^^^ made extensive use of 

effect sizes tn= estimating the p^er of statistical tests axtd in 
determining sample sizes needed to achieve tests of a given power. 
Cohofi's bddic on power analysis alsb introduced a critical 
distihctibn between two types of effect sizes: ihterpretatfale 
effect sizes axid bperative eff^t sizes ^ 



- -Mien c^culated-^from means effect sizes are 

deti^mtned 17 dividing a treatment effKt e^ressed in raw units 
(y-units) by the standard deviation of T. GDhen used the synOspl d 
to stand for the inte^retable effect size for the pbsttest^oxilyt'' 
indepencleht-group design. He added primes ahd s^ the_ 
symbol d to denote interpretable effect sizes calculated for other 



ERLC 



7 



Effect Sizes - 6 



experinientai designs. For exaiii>ier Cohen used the sytnboi d| for 

the thterpretabie effect size calculated for one sample of n 
differences Isetween paired observations (1977, p. 49): ~ 



d» - — , (5) 



irtiere 1^ is the mean of the e^ertnental group on a pretest and 

^ is the mean on the posttest. Cohen used the sysibol dj for the 

int^pretabie^fect_size caicuiated for a study to ^^icfa the mo^ 
of ^e es^erinentai g^oop is coopared to a theoretical population 
mean (p. 46). Cohen pointed out, however, that all such 
interpretable effect sizes are cpnceptualiy egui%^lent and can be 
Interpreted on a edoman scale* This is because the stluidardizing 
unit for ixiterpretable effect sizes is always the standard 
deviation of t iS^}. 

(^)erative effect sizes are ah entirely different matter. 
Pperative effect sizes are calculated lv <U.vidihg a treatment 
effect e^reiss^ in Y^units by^either the staxidard deviation of ^ 
or ^-a stax^^d deviation, from s^r ces „ of ^^^iation 

have been removed ^ one or anoth^ ad mecfaanian designed 

to increase pcmer-^ovaxiance, regression, or blocking. Operative 
effect size are identical to interpretable effect sizes only in 
experiments that do not remove jk>urces of Irrelevsuit vau^ 
from the d^i^Qident variable^ e.9.# Can|>l3ell and Stanl4^*s (1963) 
l^pe 6 experiments^ FOr other e^>eri8iehtal designs^ <^>erative 
effect sizes.are calculated w^^ The operative 

effect size d_f or p^red obs^^tions for one sa^le would be 
estimated from (C^en, 1977, p. 63): 



d = _ . (6) 



Coh^ used the symbol d, wil^ subscripts or primes, to 
r^resent c^erative effect sizes calculated for a variety of 
ei^^imental de$ignse Although denoted a cqasion syiiax>l, 
C^erative effect sizes calculated for different eii^erimental 
designs are not conceptually equivalent because different 



8 



Effect Sizes - 7 



sta^ardlzlhg_miits are used-i^ calculatin ^^erative 
effect sizes cannot therefore be interpreted in a single way* 
Operative effect sizes are useful, however, because they can be 
muplo^wi directly to find values of power in power tables. 

A critical point to grasp^ Heta-ahalysis must be 

based on ihterpretable_effect_sizeS|^ hot 
if it-is to produce interpretable results^ --Operative eff 
have an mndesix able property tlmt use 
in meta-analysis. With a given raw-score unit, they vary not only 
as a function of size of the raw^score treatment effect but also 
as a function of ttie e^erimental design used to investigate the 
effect • Two investigators sttKiying the same phehmenbh who find 
identical treatment eff^ts (lAieh effects are expressed ih raw- 
score uxiits) would rep^ effect sizes if 
they used different research designs to study this treatment 
effect . 

Su£^ser for esaaple, that Investigators A and B are studying 
the effect of a diet-su|^lement program, and each investigator 
finds that the program has the same effect oxi the deE^uiderit 
variable! __ It adds ah average of 10 ppuhds td t^ 
e^erlmental-S^j^ts«- Suppose further ttet lnvestigato^ 
used-a weak-^^wimental design^ estiboate- the eff«t and- test 
its significance (e.q«, a posttest-c^ly design for independent 
groups), lAereas investi^tcr B has used a more powerful design 
(e«g., analysis of covariance with a eovmriate that correlates .9 
with the d^>endeht variable). Even though the raw treatment 
effects estimated by the two investigators are of tlie same 
maghituder the c^>erative ef f ^t size calculated by investigator B 

will 1^ nearly twice the size of the operative effect size 

calculated by investig^or & because ^e-Stande^£L^ t^^--- 
enqpiqyed 1^ investigator A is i^e raw standard de^dat ion, 
the Standardization term canployed by investigator B is a standard 
deviation of residuals from idiicb ii^rtant sources of variation 
have been removed* Ihe intei^retable effect sizes calculated 1^ 
the two investigators, however, will be the same-*^as th^ should 
be for two raw treatnH^nt effects of the same magnitude. 

_-^t is-usefui to k^^-^^s d^sti^cti^ betwe^ op^ ^d 
interpretable effect sizes in mind irtteat^ examii^ and 
3, the formulas ttst Glass and Rosenthal use for calculating 
effect si2es from cc^arisons of ^ins.^ It should be clear that 
the fdrmiila used i:^ Glass is a formula for an interpretiOsle effect 
sizer Rdsahthal*s form is for ah c|>erative effect size^ 
should also_be clear that_61ass*s_for^ e^ropriate one 

to use in meta-analysis, axtd Rosenthal's formula is inappropriate 
for use fiTresearch synthesis. 



Although Glass and his colleagues have not mentioned the 
distinction between c^eratiye and interpretabl^^ sizes ixi 

their writings on meta-analysis^ their ^^m^ suggests that they 
were amre of the iiportahc of using ihterpreta&le_effTC sizes ^ 
rather than operative ones r in rese^ (e^g.r^Siass et 

ai«, 1981)^ The va^ size that tfac^ have 

given are all formulas for interpretable effect sizes. Rosenthal 
(1984), on the other hand, af^^ears to believe that operative 
eff^t sifZes are the a|$)rc^riate ones to calculate in research 
synthesis. Although Kraimer and 

recognize the ih^pr^riat^ess of operative effect sizes they 
appear to believe that such effect sizes are the ones that meta- 
anaiysts actually calculate. 

Standard Error of an Effect size 

One of the major contributions of Hedges to the methodology 
for meta-analysis is the derivation of a_ formula for the standard 
errorof effect. sizes. --Hedges X 1982^ derived thtsfbrmula f or 
standard^ tfror from a. set of ^pilcit assun^tions-abiDut the data 
being axiaiyzed in research syntheses. Careful^uoaj^tion of 
these as sunpttcms will reveal, we believe, that they seldw are 
met t7 meta-analytic data sets. 

Data in the model that underlies Hedges* formulas for 
analysis of effect sizes ccKme from a series of k independent 
studies^ each of which coopares ah experlm^tal^group (£) with a 

control group (C) . Hedges lets ^^^and stand for the ^th 

scores in the ith oqperi^sait f rom the experimental and control 
^oups, respectively. His model assumes that in a given study 1, 

-11 noisily distributed with means and and 

ccxnmoh variance o. Sie peculation effect size for study 1 will 
then be given by the fbllowlhg (Hedges, 19829 p. 491): " 

B r 

tl - u (7) 

8 = . 

a 



Hedges has ezamihed the properties of JSlass'S- estimator d_ 

£^uatlrai^l^_bf thls-f^pu^tl^- He has sto«m that in 

sm^l- samples Glasses d is a biased estimator of tjie population 
parameter. An unbiased estimator of 6 can be approxiseited frcsn 
the following: 



10 



Effect Sizes - 9 



1 - 



4(S + S - 2) - 1 



S ' 



(8) 



where n and n aure the sample sizes of the es^erimental and 
cbhtrbl groujps. The saioplihg distribution of this unbiased 

estiiDitor is that of a noncentral t times a constant. If both n^ 

and n are large, the distribution of d- or d is approximately 
normal with M(d) ^ 6 and 



n n 



2(n^ + n^) ' 



(9) 



where 5, the pc^lation effect, size^ is the noncentrality 
parameter (Hedges, 1982, p. 492). mis is an infiortaht formula 
because it can be used to approximate the san^lihg error of effect 
sizes estimated under certain conditions. 

|t is important t6-^^hasizer^§^ formula 
applies only to situations that meet the assuniptions made by 
Hedges^ These assunptions include (a) ind^endence of 
esperlmental and control groups, and (b) assessment of results on 
a dependent variable f rom which no sources of irrelevant variation 
have been removed in the es^>erimehtal design. The model fits a 
type of design that Ca^bell auri^ Stahl^_(1963) call a Type 6 _ 
desi^r - the ^sttest^c^y-design for i^ 

desi^ from which valid inferences can be draim_^30ut ^^er^entai 
tx«tXDmts,_faut reseucdters^ use it do not estimate treatment 
effects with great precision. 

Hany social science riseariAers prefer to conduct esperiments 
that estiJoate tr^tmeht effects more precisely. Instead of 

po5ttestH»iay_desighs_ f or ii^ 

e:i^ertmentad-destgns-aa statistical te^nifues-^at r^ove -- 
5ources of^trrelevant var^tira variables. 
Ther use multiply factor or oatded^subject designs; therycoopare 
gain scores of experimental and control groupsj or th^ include 
cp^riates in their statistical analyses. - Such designs r rather 
than the pbsttest--only desipi for independent gro dominate the 
literature of _ the sbcieil sciences. For exaqple^ none of the_14 
studies included in a rec^t meta-analysis oh coacbi^ effects ^ 
the Scholastic Aptitude Test (SAT) used a posttest-only desig^ for 



EKLC 



11 



Effect Sizes - 10 



Independmt groups JKuIik et aU^ 1984>._ Ml studies iorated for 
this meta-analysts cam^ea gains or covariance-ad justed 
postscores of experimental and control groups. 



Additional Formulas for the Standard Error of Effect Sizes 



-Hedges' Formula 9 wbuia_hot have_yieided_^ 
Mrbr f or My_of the^ effect sizes calculated ^ Kiil^ ai^^ 
Additional formulas- a^e ne^ calculate standard errors frc»Q 
studies ccoqparing lain scores a:^ studies using analysis of 
co^^iance to measure treatment eff^ts. Fortunately, it is easy 
to derive such formulas. To dp so, one tiJces advantage of an 
algebraic relations between Ca) the formulas for calculating 
average effect sizes for various e^erimehtal desighs , and (b) the 
t ratios used_tbtest^ statistical signifies 
treatment effectS-found-With these The relationship 

betweent ratios and effect size formulas has already been 
desDonstrated by Glass and his colleagues (Glass et al. 1961, 
chapter 5). 



Effect_size calculated frcx^ of ihdepehdeht groups . 

Bheh ah exp(»timen has e5tiMted_a_ treatment e^ ^.comparing 
gains of independent _^er control groups, the - 

stgnifi^ce of the effect is usually tested by a t^ratio^ When 
the assuBipticms for the u^ are met, the 

effect size in the eiqperiment can be estimated by Glass's foirmula 
for coiq^aring pdns (Formula 2). Glass and his epilogues (Glass 
et al., 1981, p. 127) have shown that an effect size calculated in 
this ray is equal to the following: 



^ =^ t^ 1^ 2(1 - Pxy)(l/n^ + 1/tF) , (id) 



where is the t ratio for testing a difference in gain scores 
and P^'^is the population correlation, brdihariiy estimated by 
r^^ t^tweeh the pretest (X) and the pbsttest (Y ) • 

It is easy to see from Ecjuatibn 10 that the effect size §^ is 

equal to a constant times the t ratio. When assuiqptipns for the 
use of this statistical test hive been met, the sampling 
distribution of an unbiased estlnatbr of this effect size is that 

of a noncantral t times the constant. With both n and n large, 
the distribution'^of d^ will be approximately normal with M(d^) ^ 8 

and 



12 



Effect Sizes - 11 



o{d^ = / 2(1 - p^) 



n n 



2(n^ + n^j 



(11) 



Hhen pretest aiid pdsttest correlate TOte than .5, the standard 
error of a study effect calculated from gains will be smaller than 
the standard error of the study effect calculated from the 
posttests only* 



independcait^roups * ^eatment effects are often tested for 
significance using aumlysis of -covariance. When assun^tlons are 
met for use of the t ratio or F ratio for testing the difference 
between the covariance-ad justed means of the e^erimehtal and 
control groups, the effect size can be estimated by the 

following: 



(12) 



where b^ is the pooled within-grdup estimate of the regression of 

final status Y on the covariate X. Glass and his colleagues have 
shown that this effect size is equal to a constant times the t 
ratio calculated for the same design (GlasSr 1981, p. 127). 



-1. + J:, 
n^ n^ 



(13) 



where t is the t ratio for testing the treatment effect for this 
"2 

experimental design and p^^ is the peculation correlation, 
ordinarily estimated hr the same value r^^, between the final- 
status score Yazid the cova^ The sampling distribution of 
ah unbiased estimator of this effect Is that of a z^c^tral t 

: - : : : : : m C - _ _ : 

times the constant. With both n and n large, the distribution 
of d^ will be approximately roraal with M(d^^) « S and 

Si 21 



ERIC 



13 



Effect Sizes - 12 



a(dgj [1- p|j) 



^ (14) 



2(h^ + h^) 



A ccaparlson of Formulas 14 and 9 shows that standard errors of 
effect sizes calculated from residual scores are genfrally smaller 
than are Standard errors of effect sizes calculated from posttests 



Effect sizes estimted fr^ posttest scores of matched - 

group s * Mien a resear^^ has used a matched-pairs design to test 
the effect of an experimental treatment, the effect size can be 
calculated from Formula 1. with such a design, the statistical 
significance Of the effect is usually tested with a t ratio for 
correlated means. Glass has shown that this t ratio is related to 
the effect size calculated for this design by'the following (Glass 
et al., 1981, p. 127): 



2(1 - Pw.) 



¥¥'^ (15) 



N 



where p^, is the population correlation of the paired posttest 

srores* This sguatlon that ah effect size estimated from 

this design^ is equal to a constant times the t ratio 

calculated from the same design. The san^ling distribution of an 
unbiased estimate of this effect size is that of a t ratio times 
thcs constant. With N large^ the distribution of d^ will be 

approximately normai m(^) ^ S and 



2(i - Pvv.) x2 



'yr ^ (16) 
+ — . 



N 2N 

l^licatibns for Meta-analyses in the Literature 

The lack of explicitness in recent books on meta-analysis 
aUbPUt the heed for a va^ for effect sizes and 

standard errors has led to flawed analyses and flawed ebhelusiohs 
in the research literature. We focus here on three meta-analyses 



o 14 

ERIC 



Effect Sizes - 13 



that: have used the newer statistical methods for research : 
synthesis* The revieKS were selected for special exami nation on 
the basis of the detail that they cCfiitaih. The first and secbzid 
reviews {Secker^_1983;Rb^ Rubih^ 19821_l£st effect.sizes 

axid standard error s_ for ii^ 

(Pearloan, 1984) contains a zramerical calcalation of cumulated 
sanpling error in a set of effect sizes listed in a report ^ 
kullk et al. (1984}. The three reviews contain enough detail for 
readers to reconstruct what was actually done in the analyses. 



ihterpersbhal expectancies ^ To illustrate the utility of 

their method of resear^ synthesis^ i^s^thal a^ St^in (1978, 
1982) applied their test of faomogen to ^ 

findings on interpersonal expectancy reported originally in a 
dissertation by KeshCKik (1971). Subjects in Keshock's 
dissertation- were 48 black inner city beys in grades 2 through 5. 
Half the children at each grade level were assigned to the 
expwimmtal treatmeht^ a^ that these 

children showed_an-^illty level one_standard devlat^ 
than their actual scores^ The teachers of the control childr^-_ 
were given the^tildren^s actual ability scores » Keshocfc reported 
that this experimental treatment had no significant effect on the 
children's achievoDfnt scores, as measured by the Wide-Range 
Achievement Test (WRftT). 

Rosenthal and Rubin ebheluded that Keshock*s analysis was 

tnadequat e because- it f ai ied_ to take into account. grade*level 

differences in tr»tm^t effects. -They reanalyzed Keshodc^s d^ 
using the HIAT gain scores that Reshodc listed in an appendix to 
his dissertation. Rosenthal and Rubin r ported their conclusions 
succinctly: 



Gains in perf brmahce were substantially greater for the 
chlldren-^bse teacher s_had_beeh_led_ to e^ect greater gains 
in perf oraance^ sizes of the effects varied a^ 
four grades from nearly half a standard deviation to nearly 
four standard deviations. For all subjects craibined, the 
mean effect size was 2.04. (P. 383). 

It is hot at all obvious how Reshock could have Mssed a^ 

ef f Mt of^tbis magn^ _ebheh i 1977 ) has_given_r6ugh_guidelihes 
for interpreting effect siz^ effects of 

atout 0.8 standard deviations shouid_^^^ large; tfa^ can 

usually be detected l:^ eye without the aid of special measuring 
tools. Although Rosenthal and Rubin Vs reanalysis showed an 
average effect size. of 2.04, Keshock had not noticed any effect of 
teacher expectations on HRAT scores. Nor did his original 



Effect Sizes - 14 



statistical analysis disclose such efferts. How cduid Keshock 
have failed to note so large ah effect? 

Hoif also c6u3^^^sfa6ck's e^erimental treatment have produced 
an effect of this magnitude? Tochers of children in the 
e^erimental group were sii^ly told that their children had iQs 
one stimdard deviation M their actual IQs. iU^cording to 

Rosenthal and Rubihr this sin|)le manipulation raised achiev^ent 
scores of experimehtalr^oup children 1^ ah average of Xmo 
stahd^d d^iatlons^- Scores of second rose_by an average 

of aimogt- four _ standafd d^dattons j -^ese s^e ^6ra6us_ gains i _ 
One standard deviation on an IQ scale is equal to 15 points, for 
ezanple; a ^in of two standard deviations on such a scale is 
equivalent to 30 points, and a gain of four standard deviations is 
equivalent to 60 points. 

Rbsehthaland Rttbihis results are hot so paradoxical as they 
may at first a^ear to be« _H6senttel_and Su&in calculated effect 
sizes from Keshock 's t ratios for testing the significance-of-a 
difference in gains, but they did not use the appropriate formula 
(Equation 10) • Instead, they converted the 4 ratios to effect 
sizes using an equation apprqpriate for the % test for cxms^ixiq 
final-status scores: 



d = t / i/n^ + i/n^ 



(17) 



Rosenthal and Rubin therefore e^ressed the treaitment effects for 
Keshock- s study not in terms of variation in achievanent but 
rather in terms of variation in achievem e nt g ains ■ Such effect 
sizes are hot ihterpretable oh the same scale as are other effect 
sizes. 

- Coxtv^tiraal eff^ estlMted easily for ___ 

keshock' s e^eriment. Scores on the HRI^ reading and arithmetic 
tests are standardized scores with a mew of 100 and a standard 
deylation of 15. Raw treatment effects reported by keshock on the 
RRAT were 5.54 points in reading and 5*91 |K>ints in arithmetic.^ 
These gains are axmlogbus_ to increases of this magnitude oh ah IQ 
scale^ Just-as it muld be misleading tb_refer to a gain of S l@ 
points as repres^ttog 2^0^ 

to refer to a gain of this size on the ptftf as equivalent to ah 
effect size of 2.04. The average effect size on the HRAT was 0.37 
standard deviations in reading and 6.39 standard deviations in 
arithmetic. 

Table 1 cOTpwes our esttM sizes for Keshbck^s_ 

data with estimates made by Rosenthal and Rubin, bur calculations 



ERIC 



Effect Sizes - 15 



ot coBfoslte effect sizes are based du a reported^ correlation of 
• 7 between arithmetic and reading scores on the WRAT. Our 
estimated. st&ndard errors are based on retest correlations of ^93 
for WRJ^ cotq>bsite scores i 



Sender-aiBi SQScept ibiiity to^inf i;ience > Becker^s (1983) 
effect size synthesis on this topic used Hedges' < 1982} 
homogeneity a|yproacdi. JHote than 100 studies in this area were 
brlgl'ially located l:^ Eagly^ who used both a bbx-^scbre approach 
surid meta-a^lytic oietbodb^ tb_ integrate the results of the 
stales itogiyr l93fli Bagly 6 GMli^ 1981U Eag^^ malyses 
showed that males md females differed significantly in 
susceptibility to ix^luence in the three areas of persuasion, 
conformity, and group pressure. Becker thought, however^ that a 
reanalysis of Eagly's data was needed.^ To her, Eagly's box- score 
approach seemed clearly inadequate, and even Eagly and Carli's 
meta-analytic methods seaned tb be ad hbc . 

- -_ Beck« tf^Jted-in^vidualiy each of^^ effect sizes- ^ ^ 
reported ^ Eagiy and Carli, and she reported that they were quite 
accurate for the most part^ Only a handful of the nearly 100 ^ 
effect sizes recalculated by Becker differed by as much as 0.10 
from Eagly and Carli's briginal estimates. Becker used her own 
recbded effect sizes rather than Eagly and Carli's estimates in 
her reanalysis of results in this area. 

--- Becker's reanalysis led her to question Eagly and C^ 
inclusions • Like^ Eagly and Carli, Becker found average 
differences between males and females in susceptibility to 
influence, but she did not attach much weight to this finding. -__ 
More ljq)ortant to her was her bbservatibn that variation in study 
effects was more clbsely related tb studir methbdblbg^ 
tb indieatbrs of sex bias^ Becker pointed but ^_fbr_e^^ that 
vtf iati^-ih_results in- studies ras related to type 

of outcome- variable used in the study--a methodological feature- 
rather than to such factors as gender of the investigator, gender 
bias in message, and so on« Studies that used postscpres on an 
attitude measure as the outcome variable (Group Z studies) 
produced neu-zerb effects; studies that used chaurige sscores^ 
cbvariahee-adjtisted scores^ or their equivalent (Grbup II studies) 
produced more sizeable effects. 



It is possible to evaluate Becker's conclusions because she 
provided a figure showing effect sizes and standard errors of 
these effects for the 36 per suasibh studies . Cbn^>aring Bee ter ' s 
statistics wlth results in the briginal repbrts shbw^ that she 
calculate operative effect sizes for all studies— ho matter what 
type of experimental design or depoident variable in: s used in a 



17 



Effect Sizes - i6 



study. Such eff^t sizes do not eistimate the same quantities for 
different research designs. 

Becker's calculations led her to reach the following 
concltision: 

Hethodolpgicai considerations rather than features 
r^resenting sex bias e^laiii the varl^ility in persuasion 
study results .The TO stable sex difference for persuasion 
studies. that had_been_hoted by Macc^jy and aacklin_il97^ 
seens to largeiy_ spurious . Giv^ th^ for 
the size of the dlffer^ces idth this methodological 
artifact, claims of sex bias must hold less sivay. (P. 13). 

It seals to tis that it is not a methodological characteristic of 
the studies that e^lalns the different effect sizes in Group I 
azid Srbup Il e^erimehts. It is rather Becker's method of 
calculating effect sizes that explains her finding^ (^oup 1^ 
treatment effects were standardized on a final-status measure, S^; 

Group II treatment effects were staxidardlzed oh measures of gain. 



- — — — — — - " 2 

S^ = Sy|/'2(1 - r^J , or on residual insures, S^ = S^i^l - r^ . 

Given the differ^ce in the units on which Group I aHd II 
treatm^teffectsare standardized^ meaningful cdhcluslbns cannot 
be draim from an analysis of the c^^xned data set. 

C o a c h i ng effects . Pearlman's (1984) effect s^ 
ras based on 38 studies of coaching effects originally analyzed by 
Kullk et ai. (1984) • Kulik and his colleagues had concluded frcxn 
their meta-fluialy sis of results reported in thi studies that : 
coaching programs in general have positive effects bh aptitude 
test performance^^- fh^_cautiohedr hbwc^er^ that r^ 
in tm literatures onucMchih^ literature on the SfiT and the 

literature on other ^titude tests^ Kiilik aaul his co-wor^ 
unable to ei^lain much of the variability in effect sizes within 
these two literatures, and they concluded that "it was is^ssible 
to e^lain fully lAiy coaching results differ frcmi study to study 
as much as they do" (p. 187). 

Pear Iron analyzed tfae-^ta-ass^bled 
workers using the methodology for effect size synthesis developed 
by Hunter et al. (1982). Pear lman*s analysis led to conclusions 
that differed from Kulik *s on all major points. Pearlman 
concluded, for example, that coaching effects are small for both 
the SAT and for other tests ;i Bbst i^pbrtaht^ Pearlma^ concluded 
that a substantial portion of the o^^ effect 
sizes was attributable to a statistical artifact — sanplihg error — 



o 18 
ERIC 



Effect Sizes - 17 



rather than to true differences in results from different studies. 
Pear loan interpreted his results as suf^rtinq the conclusion that 
between-stiKly differences in coaching effectiveness are much less 
extensive than they appear to be. 

-- Close^camination of Pear lmah*s calculations^ they 
are seriously flawed* These flaws can be se^ most cleetriy in 
Pearlman's analysis of results from 14 SftT studies included in the 
total pool of 36 studies. Pearlman calculated the total saspling 
error in thi 14 studies using a cumulation formula (Hunter et al.r 
1982^ p. 102) that is closely related to Formula 9 and is 
appropriate f brcalculatihg sampling error with independent-group, 
posttestronly efperlm^tal d the_14 S^ studies, 

used such a design^ howeyer^^ and^n^e of-the 14 effect sizes for 
these studies was calculated frcaa such formulas as 1 or 17« The 
formulas that Kulik and his co*workers used to esstimate effect 
sizes weri formulis for interpretable effect si^es: Formulas 2, 
10, 12, 13, and 15. The appropriate formulas for calculating: 
standard errors of these effect sizes are Formulas 11^ 14^ and 16. 



-- ^e results of Pear imanls failure to t design 
iiito accoimt are^serJons^ He es of the 

variance in the distribution of SftT study effects could be 
attributed to saspling errors in the individual studies. If he 
had taken into accpuntzthe fact that all 14 SAT studies used pre- 
|M3st designs and that SAT retests correlate .88, he would had 
to conclude that sailing error of effect sizes in individual 
studles_cbuld_nbt ac»TOt for TOre than^ __ 
variation study ef fee s^e of tte SAT stupes used 

additional coyariates, matted-pairs designs, and factorial- ~ 
designs that further reduced sanpling errors, the true proportion 
of study effect vmriation attributable to saotpling error may be 
even lower. 

Conclusions 

A|^§9^^i^^y^^^»rdi^ have fbtind Glass's me^ 
methodology appealing and useful, seme methodologists have 
criticized the statistics that underlie this methodology. Hedges 
and OlRin (1982), for exaople, have described Glass -s procedures 
as ad hoc and generally lna|:{)rc^r^^^ have also asserted 

that huxidreds bf meta-analyses 1^ on Glass 's mbdel have 

used statistics that we that are 

demonstrably Incorrect (Hedges fi Oikin, 1985, p. ii)^ ^Tiie 
conclusions of these meta-analyses may indeed be correct,** Hedges 
and blkin have written, **but the statistical r»soning in support 
of these conclusions is not** (1985^ p. 14). 



19 



Effect Sizes - 18 



: -Hedges and Olkln (1985), Hunter^ Schmidt, and Jackson (1982), 
and Rosenthal (19841 have In recent books tried to firm up the 
statistical basis of meta-analysis. Careful exami^ their 

bbbkS- shotfs ^ however that _they_ fail to distihguish between 

interpretable and operative_effect slzes^ aM 
guidelines on the ralcuiation of interpr etable_ef feet sizes for 
studies with different experimental designs. Their failure to 
consider the influence of e^erimental design on effect size 
calculation seriously limits the utility of their work. 

__ _It_is no surprise to find that user£> of the statistics 
advocated in these rec<mt books have produced works with serioiiS 
flaws: 

These meta-analyses have of ten been based on operative rather 
than interpretable effect sizes. Because o|^atlve effect sizes 
cure usually calculated with reduced standardization termSf 
bperativeef f ect sizes are inappropriate for use in zoeta-ahalysis. 
When bperativeeffectsizes rather them interpretable ones are 
used-in research syntheses^an «tlf actral re^ 
tetween effect sizes and experi^ designs, and average effect 
sizes often become seriously inflated. 

2. These rtieta-analyMS have of ^ : 
stsmdard errors. Hethddpldgists who have written sOsout staxldard 

errors of effect sizes have_presehted cmly 

calculating such_st^dard errors^ This for^la gives r 
results only when allied tO-Studies using an unblocked, posttest^ 
only design^ The formu produces inaccurate results i^en applied 
to studies that estimate effects with greater precision — or for 
the majority of studies in the social sciences. 

: We believe that valid conclusions c^ be drawn from meta- 
analyses in wbich_eff ect sizes are mis 
concluslons_be draim_^en meta-analy^ 

errors to test the homogeneity of collections of effect sizes* We 
are therefore more pessiooistic in our assessment of r^ent meta- 
analytic work than Hedges and Olkin were in their evaluation of 
earlier meta-analytic results. We believe that both the 
statistical reasoning and the conclusions ar^ likely to be 
incorrect in studies that have used the newer statistical methods 
for research synthesis. 



o 2Q 
ERIC 



Effect Sizes - 19 



References 



Bangert-Drownsr R« 1< (in press).: A review of developsents in 
meta-analytic inethbd. Psychological Bulletin . 



Becktt^ Bi_Ji_ X1S83^ iprili._ infjumce again: A c^gartsbh of 
methods for meta-analysis ^ Pap^ presented at the annual 
meeting of the American Educational Ressirch Association, 
Montreal, Canada. 

Canpbell, D. T., S Stanli^, J. C. (1963). E^erim^tal and 
quasi-esperimental designs for research oh teaching. In 

Gage (Ed. 5, Handbook of reseaurch on teaching . 
Chicago: Rand McNally^ 



Cbchran, W. G. (1954). The cwibination of estimates fr^ 
different e^eriments. Biomet rics , 40 , 101-129. 

Cbheh^ J. (19775. Statistical power analysis for the behavioral 
sciences . (Rev. ed.). New Ybrlc: Acadi^c Press. 

Eagiy, A^ H^- <i978>^ Se^ in inf iuencea^ility. 



Eagly, A. H., ^ Carli, L. L. (1981). Ses of researchers and 

sextyped cqminlcatlbns as determinants of sex differences in 
ihfluehceability: A meta-analysis of social influence 
studies. Psychological Bulletih j 90^ 1-20. 



Giass, 6« (1976>^ Primary^ secondary, and reta-analysis of 
research. Educati ona l Re s e archer , 1, 3-8. 



Glass, G. y, HcGaw, B., & Smithy H. L. ( 1981 ) . Heta-anal ys4s^ 
social research . Beverly Hills: Sage Publications. 

Hedges^ V^- 1198U^_ Distribdtibh_theory for Glass/s estimator 
of effect- size and reiated estimators. Journal of 



Hedges, L. V. (1982). Estimation of effect size from aseriesof 
independent e^erimehts. Psychological Bulletin , 92(2), 490- 



HedgeSr t» & bikin^ Ii-^_Xi982)^- to and 
meta-analysis. Oontemporary Educa t i on Re vteW t 1, 157-165. 



Hedges, L. V., & Olkih, (1985). Stati^ticalr methods ^or meta-- 
analysis . Orlando, PL: Acadonic Press. 





499. 



Effect Sizes - 20 



meta-aimiyslsr^-Qtta ntttat t ve^methc > d s for cu n m i attng res ea r ch 
findings^ across-studtes , San Francisco: Sage Publications. 



KeshocR, J. D. (1971). An investigation of the effects of the 
e^>ectw<7 phenoDeh upon the intelligence^ acbievcsneht and 
nibtivatiohs of imer-city el^ school childrehi^_ 

gissertatioh j&stracts international t 32, Ol-A (University 
Microfiims No. 71-19,010) 



Kraemer , B.C., Andrews ^ G . ( 1982 ) • A npnparaffietric_ technique 
for iiieta-analyjls eff^t size calculation. Psychological 
Bulletin, 91, 404-412. 



Kulik* J^_A. , Bahge^t, _Ri_L . , 6 Kulikr e.-L . e. (1^84) _ __ 
Effectiveness of coaching for aptitude tests. Psychological 
Bulletin . 95, 179-t8i3. 

McGaw, B., & Glass, G. V. (1980). Choice df^the metric for 

effect sizi in meta-analysis. American Educational Research 
Journal i 17, 325-337. 



Pearlmah^-K^ ( I9B4 ^ _ August Validity gmeralizations: _ 

Methodological andzst t bstantive impiicatiotts=for^eta"-anaI 
research . Paper pretented at the annual meeting of the 
American Psychology Association, Toronto, Canada. 



Rosenthal, R. (198$)* Heta-amalytic procedures for social 
research . Beverly Hills^ CA: Sage Publications. 

Rosenthal^ R.^ fit Rubin^ D. B. (1978). Interp^so^l e^ectancy 
eff^tSi The first 345 studies. The Beh a v iormi^oid -Brain 
Sci ence s t i, 377--386. 

Rosenthal, R., & Rubin, D. i. (1982). Cpii{>aring effect sizes of 
independent studies. Psychological Bulletin , 92(2), 500-504. 



32 



Effect Sizes - 21 



Author Note 

__ _Tbe material iJi tMs r^)ort i upon work sujpported.^ 

Vatiraal Sctgxce goundatton (3rant^M6^ 8470258 ^ _ Any ppinions , 
fiiidings, and conclusions or r^ome^ expressed in this 

report are tlrose of the authors and do not necessarily reflect the 
views of the National Science Foundation. 

Requests for reprints should be sent to James A. Rulik^ 
Center for Research on Learning and Teaching ^ The University of 
Michigan, 109 E. Madison St., Ann Arbor, Michigan '48109. 



23 



Effect Sizes - 22 



Table 1 

Conparlsbn of Effect Sizes in Keshock's (1970) Study 
Estimated 6y Different Formulas 



Grade 



Effect Size 



Estimated by 
Rosenthal & Rubin (1982) 



Estimated from 
Formulas 2 and 11 



2 
3 
4 

5 



M 

3.85 
2.34 
0.47 
1.48 



SD 
1.03 
0.78 
0.58 
0.66 



H 

0.89 
0.53 
b.ll 
0.34 



SD 
0.22 
0.22 
0.22 
6.22 



Mean 



2.04 



0.73 



0.44 



0.11 



ERIC 



24 



